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The Longest Common Subsequence (LCS) problem is a 
fundamental problem of sequence comparison. A natural ap- 
proximation to this problem is a model in which every pairs of 
letters of two "sequences" are matched independently of the 
other pairs with probability 1/S 1 , S representing the size of 
the alphabet. This model is analogous to a mean field ver- 
sion of the LCS problem, which can be solved with a cavity 
approach We refine here this approximation by incorpo- 
rating in a systematic way correlations among the matches in 
the cavity calculation. We obtain a series of closer and closer 
approximations to the LCS problem, which we quantify in 
the large S limit, both with a perturbative approach and by 
Monte-Carlo simulations. We find that, as it happens in the 
expansion around mean-field for other disordered systems, the 
corrections to our approximations depend upon long-ranged 
correlation effects which render the large S expansion non- 
pert urbative. 
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I. INTRODUCTION 

The Longest Common Subsequence (LCS) problem is a 
simple and fundamental example of a sequence comparison 
problem. Such problems arise under various important situ- 
ations, ranging from biology to combinatorics and computa- 
tional sciences JD|. A frequent problem of molecular biology 
is the detection of evolutionary relationships between differ- 
ent molecules JLfJ: Given two DNA molecules which evolved 
from a common ancestor through a process of random inser- 
tions and deletions, how can one recover the ancestor? A 
possible answer is to solve a particular instance of the LCS 
problem, namely to look for sequences of nucleotides that ap- 
pear in the same order in the two DNA molecules, and to 
pick such a common subsequence that is as long, i.e. con- 
tains as many nucleotides, as possible. Replacing the two 
DNA molecules by two general sequences X — (Xi, Xn) 
and Y = (Yi, Ym) (not necessarily of equal lengths) taken 
from a given alphabet, one obtains a general instance of the 
LCS problem. As it is natural to expect, when X and Y 
are very long sequences whose elements are taken at random 
independently from an alphabet of S letters (with S > 2), 
there is a definite density of matched points in a LCS of X 
and Y . More precisely if Ln denotes the length (the number 
of letters) of a LCS of (Xi, Xjv) and (Yi, Yjv), one can 
prove (see e.g. [jHj) that with probability one, Ln/N tend to 
a non random constant 7s as N — > 00. The determination of 
7s and of the rate at which Ln/N approaches this limit are 
much studied combinatorial problems |41h,q|. A connection 



with statistical physics has been provided by Hwa and Lassig 
|| who found that Needleman-Wunsch sequence alignment, 
a popular comparison scheme for DNA and proteins of which 
the LCS problem is a special case , falls in the universality 
class of directed polymers in a random medium. This con- 
nection is based a geometric interpretation (explained in the 
next section) of the LCS problem as a longest path problem 
fll5| . The randomness in the above "Random String" model 
can be encoded in variables tij defined as occupation num- 
bers for the matches of X and Y, namely = Sx^y = 1 
if Xi = Yj and otherwise. The presence of long-ranged 
correlations among the matches (for example given any in- 
dices ii,ji,i2,h, the variables j x , et x j 2 , £j 2 j x , Ei 2 j 2 are ob- 
viously correlated) complicates the problem very much, and 
to date the computation of the average length of a LCS has 
turned out to be intractable. In Q, we studied a related 
"Bernoulli Matching" model where the 's are taken to be in- 
dependent and identically distributed random variables with 
P^ij = 1) = 1 - P(eij = 0) = It turns out that 

this model is very analogous to a mean field version of the 
LCS problem, which can be solved using a cavity approach. 
This solution was found to provide a very good approxima- 
tion (whose precision ameliorates as the size of the alphabet 
increases) to the average LCS length of two random strings 
measured from direct Monte Carlo simulations. We pursue 
here the work of Q by studying the behaviour of the above 
"mean field" approximation in the limit of large alphabets. 
We describe a method which allows to refine the cavity cal- 
culation made for the Bernoulli Matching model, by taking 
correlations of the Random String model into account in a 
systematic way. This leads to a series of approximations get- 
ting closer and closer to the LCS problem, which we quantify 
within a perturbative approach valid in the limit S — > 00. 
We find that, while our perturbative approach provides an 
excellent approximation to the LCS problem at finite S, it 
leads to a singular expansion (in powers of \/\/~S) around 
the Bernoulli Matching model. In particular, the leading cor- 
rections to this mean-field approximation depend upon long- 
ranged correlation effects among the matches and cannot be 
captured by the method we use. 



II. THE CAVITY SOLUTION TO THE 
BERNOULLI MATCHING MODEL 

Consider the lattice Cnm formed by the integer points (ij), 
0<i<N,0<j<M together with nearest neighbor bonds, 
and add a diagonal bond {(i — l,j — 1), (ij)} for each point 
(ij) such that eij = 1 (we call such a point a match). Define 
the weight of any path on Cnm to be the number of diagonal 
bonds that it contains, and let Lij be the maximum possible 
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weight of a directed path joining the point (0,0) to (ij). In 
the Random String model Lij is just the length of a LCS 
of the substrings (Xl, ...,Xi) and (Yl, Yj). Setting Li,o = 
Lo,j = 0, the Lij's satisfy the following recursion relation: 

Lij = max(Li-ij , Lij-i, Li-ij-i + (1) 

which follows from the fact that any directed path ending 
at (ij) must visit one of the points (i — i,j),(i,j — 1) or 
(i — l,j — 1). It turns out to be more convenient to work 
with the local gradient variables Vij — Lij — Li-±j and [lij = 
Lij — L i ,j- 1 , rather than with Lij itself. It is obvious from 
(0) that Vij and fiij can take only the values or 1. Writing 
i = l — xifa;G{0,l}, the recursion relations for i/y and |Uy 
can be written in algebraic form: 

v ij = (1 — tijVij-^iM-i,] 

Vij = (1 — (2) 

with Vifi = vo,i = Hi,o = fio,i = 0. The key property which 
was used (but left unjustified) in Q is that in the Bernoulli 
Matching model the variables Vij and /zy along i + j = t 
become independent in the limit t — *• oo. This can be viewed 
as a consequence of the directed polymer picture of || , if we 
interpret Lij as the height profile L(x,t) (as a function of 
x — i — j and i = i + j) of a growing ID interface, described 
in a continuum limit by the Kardar-Parisi-Zhang equation 
(KPZ) In this limit it is known Jl(| that the gradient of 
L(x,t) become decorrelated along x as t — > oo Q The Vij's 
and ftij's could still have finite ranged correlations along the 
x direction at the discrete level of the model, however this 
does not happen here. This can be seen from a Markov chain 
approach which we present in the appendix. The consequence 
of this decorrelation property is that we can use eqs. (0) in 
a self-consistent way in order to compute the probabilities 
Pij = P(vij ~ 1) and p'ij = P(/J.ij = 1) for i,j large. In this 
sense we may view the Bernoulli Matching model as a mean 
field model in which (||) are "cavity equations" p| . Assuming 
independance of A*i,j-i an d tij in (^) we get 

Pij, = l-p'i-i,j - (!- 1 /S)(1 -pij-i)(l -p'i-i,j) 
P'a = 1 - Pi,j-i - (1 - 1/S)(1 - p itj -i)(l - p--!,,-)- (3) 

These equations can be solved in a continuum limit Q], lead- 
ing to 

P ^ r ^ = g- 1 ' P = S- 1 ^ 
where p(r) = linn^oo Pi,ri and p'(r) = lim^oo and 

7#(r) = lim =p(r)+rp'(r) = — - — -. (5) 

Note that (0) and (|) are only valid for 1/S < r < S. If 
r > S, resp. r < 1/S, the process evolves towards the state 
(p,p) — (1,0), resp. (p,p') = (0,1) (this is a "percolation 
transition" of the LCS problem pi). 



The author is grateful to R. Bundschuh for pointing this 
out to him. 
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FIG. 1. Scaling of es = 7s — 7S with S. Log- log plot for 
20 < S < 130 (error bars not reproduced), together with a 
reference line of slope —3/2. 



III. BERNOULLI MATCHING MODEL VERSUS 
RANDOM STRING MODEL 

Let us briefny compare eq. (|B|) to the numerical estimates 
obtained for the Random String model. For simplicity we 
shall restrict to the case r = 1 (random strings of equal sizes) . 
Using Monte Carlo simulations and a finite size scaling anal- 
ysis P| it was found that the relative error (75 — 7s)/7s 
(with 7# = 7 |(r = 1) = 2/(1 + y/S)) is about +2% for 
S — 2 and S — 3, and decreases for 4 < S < 15 (it is 
about +0.9% for S = 15). Figure [j] reproduces the be- 
haviour of the difference e$ = 7s — 7s in a log-log plot for 
S up to 130. Numerically es = 7s — 7S decreases rather 
fast at large S, showing a 1/S"* dependance for a value of 
a compatible with 3/2. We remark that a simple expan- 
sion holds for the Bernoulli Matching model, as we have 
571/(2-^-2) = 1/(1- 1/S) = 1 + 1/S + 1/S 2 + .... Antic- 
ipating on a similar expansion for the Random String model 
we would expect corrections in the left-hand-side of this rela- 
tion to occur in the 1 / S-term. 

IV. INCORPORATION OF CORRELATIONS 

We now come to the question of computing corrections to 
the above approximation, by incorporating some of the corre- 
lations of the Random String model in our calculation. This 
can be done in a systematic way as follows. We iterate re- 
lations (^) a certain number, say k of times. The result- 
ing equations are averaged, taking into account correlations 
among the Uj's, to built up the transition probabilities of a 
Markov process which we use as a refined approximation to 
the LCS problem. This approach is similar to the n-tree ap- 
proximations which were used by Cook and Derrida to obtain 
a 1/d expansion for the directed polymer problem on finite 
dimensional lattices |5(|. We note however that the Bernoulli 
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Matching model is very different from a model of directed 
polymers on a hierarchical lattice, and the word "tree" would 
be somewhat misleading here. In order to analyse the above 
new process we use a perturbative approach, assuming that 
the variables wy and fiij for i + j = t are independent in the 
stationary distribution, as they are in the Bernoulli Matching 
model. It is then straightforward to obtain a self-consistent 
equation for p — Yimi-, 00 pa in the form f s \P) = p, where 
fs^ (p) i s an ^-dependent polynomial of degree 2k in p. The 
positive solution p& to this equation provides with a new ap- 
proximation 7g fc ' = 2pg to 7s. Since there are no 3-term cor- 
relation in the Random String model (correlations among the 
tij 's occur only for configurations forming loops on the square 
lattice, e.g. in the four corners of a rectangle), it follows that 
no correlation in the disorder occur at level k — 2, so 7g fe - ) 
differ from 75 only for k > 3. An explicit computation shows 
that the equation f s (p) = p has only one positive root, at 

(k) 

least up to k = 5. The corresponding values of 7^ thus pro- 
vide sensible perturbative approximations to 7s, which are 
reproduced in figure 2. Note that the estimates are improv- 
ing, at least up to k — 5 for S > 3. The successive values of 
72 are not incompatible with a non monotonous approach 
to 72. The relative error (7 S *° -is)hs at k = 5 is of -0.48% 
for 5 = 2 and +0.28% for S — 3, a significant improvement 
compared to the error committed with the Bernoulli Matching 
estimate 75 . This approximation scheme would be perfectly 
consistent if a decorrelation property occurred at every levels 
k. This is in fact not the case, for example one can show that 
in the invariant distribution of the process at level k = 3, the 
variables Vij and [lij are necessarily correlated. In the KPZ 
picture we may say that for k > 3, there remains as t — > 00 
short-ranged correlations along the x-direction in the local 
gradients of the growing interface's height. However these 
correlations turn out to be numerically very small, which ex- 
plains why our perturbative approach gives already a pretty 
accurate result at 5 — 2, 3. Moreover when S becomes large 
this approach becomes more and more accurate, as the ex- 
act invariant distribution ressembles more and more that of 
the Bernoulli Matching model, and we expect that the lead- 
ing corrections introduced at a given level k are captured by 
this approximation. We now evaluate the behaviour of 7 S 
as 5 — *• 00. This evaluation involves comparing f s k \p) to the 
analogous polynomial fg (p) computed within the Bernoulli 



Matching model. The coefficients of 5f s K> = fg^ ~ f s K> are 
directly related to correlations among the ey's. For exam- 



fCO 



pie the computation of 6f s 
have 



(3) 



involves the 4-correlation term 



J2 C I2J1 ^232 



>= (1 + 1/(5 - 1) 3 )(1 - 1/5) 4 , and we 



5f ( s 3) (p) = 



;(l-i)*(l-p) a (l-/F ) (p)) a 



(6) 



with /W(p) = /f W (p) = 1 - p - (1 - 1/5)(1 - pf. The 
coefficients of Sf^ all turn out to be of order 0(l/5 3 ) or 
smaller. For completeness we also give the expression of the 
polynomial fg (p), which reads 



~2(1-I) 2 p(1"P)(1-/s 1> (p)) 



1 > 1 




FIG. 2. Perturbative approximations to 75. This is a bar 
graph: For each 2 < S < 10, the first to fifth bars from left 
to right give respectively the values of 7I? = 7^ , 7s 3 ' 1 , 7c; 4 '', 
"f s 5 \ and our numerical estimate of 7s- 



-(i - i) 3 (i - f ( i\p)?(p 2 + (i - |)(i -p) 2 ), (7) 

where /< 2) (p) = /| (2) (p) = of^ (p). If we now let dp^ = 



— Pg fc ' where Pg^ is the positive solution to fg^(p) 



Ps 

p, i.e. Ps = Ps = 1/(1 + ^/S), a standard computation 
leads to 



x CO 



Sf s k) (P*s) 



l-i/ S S(fc) (pf) 



(8) 



up to negligible terms at large 5. It can be checked that 
■ffs (k) (Ps ) = 1 ~ 2k/VS + 0(l/S). It follows then from (|) 



that for fixed k, the correction Sp s is of order 0(1/ S ' ), 
which cannot account for the observed 1/S 3 ^ 2 behaviour of 



Ys 



7s • The computation gives 5p^ ~ 1/65 



78/2 



<5p^ 4) ~ l/25 5/2 , and Sp^ ~ l/5 5/2 , together with correcting 
terms in the form of series in powers of 1/y/S. We could not 
extract the general terms of these series for arbitrary k, but 
we strongly suspect that (at least) the coefficient Ak in front 
of l/5 5//2 diverges at large k. The argument goes roughly as 
follows. The correlation terms involved in the computation of 
5f s h \p) can all be put into the form 



n > = < (1 



•(1 



,)> 



(9) 



Consider any such correlation term, and let us expand the 
product and take averages, up to order 1/5 3 . It is not difficult 
to see that the result is 



= 1 



< (1 - en»)-(l - e»iii) >= 
I S \2 S 2 l3/ 5 3 5 3 ^ ^S*' 



= (l-i) ! (l + ^ iS ,, 



1 +0(^)). (10) 



where ni 1 j 1 ...% l j l is the number of rectangles that can be 
formed with 4 corners on the graph made up by the lattice 
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points (iiji), (iiji)- At level k we have to consider rect- 
angles formed on the triangular lattice Afc made up by the 
points such that < i,j < k and i + j > k. The 

number n k of these rectangles satisfies the recursion relation 
nk = 2nfc_i — Uk-2 + k(k + l)/2, which in the large k limit 
gives -§gznk ~ fc 2 /2, leading to ~ ^-fc 4 (from a more pre- 
cise computation, taking account of no 
finds that n k = ±k A + \k z + 11 '~ 2 1 1 N 

are involved in Ak, as for example the polynomial Sfg K> (p) 
always contains a term of the form: 



_,j , 24 n. t |). All these rectangles 



and rii = 1, one 

mg] 



1 fc(fc+i) 



< n ^>)(!-p) 2 



(ii) 



(ij)6A fc 



which gives a contribution n k /S 3 + 0(1/ si) to 5/^°(ps)- 
Unless some special cancellation occurs between the different 
correlation terms, we thus expect that the behaviour of Ak 
will be approximately given by A k oc n k /2k, i.e. we find 
that it diverges like k z at large k. We conclude that the 
leading corrections to the Bernoulli Matching model in a large 
S expansion depend on long-ranged correlations among the 
matches in the Random String model. In order to capture 
them we should look at arbitrarily large values of k, for which 
the above perturbative approach is no longer valid. 



V. CONCLUSION 

The main point of this paper is that, while the Bernoulli 
Matching model provides a natural and accurate mean-field 
like approximation to the LCS problem valid in the limit of 
a large alphabet, the corresponding large S expansion is non- 
perturbative: Inclusion of finite-ranged correlations leads to a 
series with diverging coefficients, while the overall behaviour 
of the expansion at large S does not reproduce the observed 
gap between the two models. This contrasts with the re- 
sults of [||, where the n-tree approximation led to a consis- 
tent 1/d expansion for the directed polymer problem. As 
already pointed out, we are dealing here with a different kind 
of mean-field approximation. The Bernoulli Matching model 
is not an infinite dimensional model, and replica symmetry 
is not broken in this model 0. Note that the 1/d expan- 
sion for the directed polymer problem is also known to be 
singular, but in a more subtle way: Replica symmetry is re- 
stored at finite dimensions, leading to important "tunnelling" 
effects between the energy valleys of the mean-field picture 
Jl2| . An interesting feature of the LCS problem is that the 
corrections induced by finite-ranged correlations, while singu- 
lar, remain within a perturbative series in powers of 1/yS. 
In this respect the situation for the LCS problem seems more 
favourable than in other combinatorial problems where the 
correlations in the disorder induce non-perturbative correc- 
tions in an expansion around the mean-field approximation 
[0. This makes the LCS problem an interesting model for 
investigating this kind of singularity. 



APPENDIX.- MARKOV CHAIN APPROACH TO 
THE BERNOULLI MATCHING MODEL 

Let us denote by (vu) = {vij , iMj , i + 3 = i} the state of 
the process defined by (0), with t interpreted as time. In the 
Bernoulli Matching model the evolution is Markovian, i. e. the 
transition from a given state at time t to another state at time 
t + 1 is not affected by the states at times t' < t (this is not 
the case in the Random String model, where the transition 
from time t to time t + 1 is affected by the whole history of 
the process). We will first show that, as a Markov process, 
the Bernoulli Matching model admits invariant distributions 
in which the components of (upb) are completely decorrelated 
(we mention that the same result has been found in a different 
way in Q, in the case N = M). Consider the relations (Q) re- 
stricted to a given cell of the lattice Cnm- The corresponding 
"one-cell" transition probability Pi(u, fJ,\v', fj.') is given by 



Pi(y,fj,\v',fj,') 



= s 

+vp,(u'n' 



+ v[iv n + Ufiv fi 
1 - 1/S)v'jf). 



(12) 



A simple computation shows that the one-cell Perron- 
Frobenius equation Pi .7Ti = 7Ti (with matrix notations) has a 
solution of the form tti(u, fi) = (pu + (l—p)u){p'fj,+ (l—p')p,) 
provided the probabilities p and p satisfy 



-p + p' + (S - l)pp . 



(13) 



Suppose now that we let the bonds on the lower corner of any 
given rectangle be occupied independently with probability p 
for horizontal bonds and p' for vertical bonds. A moment's 
thought shows that the same distribution will hold for the 
upper corner bonds if we let the occupation numbers for the 
bonds inside the rectangle evolve accordin g to (^|), as long as p 
and p' satisfy ([u|. Hence any solution of ( |13j ) provides with a 
decorrelated invariant distribution as was claimed. In a con- 
tinuum limit, these invariant distributions can be identified 
locally with the "pure" invariant distributions of the process, 
i.e. those invariant distributions evolved from a single initial 
state of the variables v, fj,. More precisely let us impose peri- 
odic boundary conditions along the x = i — j direction (this 
is a way of working "locally"). We let v f — [y\, v\/) and 
f/ — (fi\, f/ L ) be the state variables at time t on a band of 
"width" L, and we adopt a numerotation such that (□) reads 



-t+i-t 



f+i / 1 
Mi = (1 



\ -* 

-r\ — f 
MiJ^i + li 



-t+l-t\-t 



(14) 



for i = 1, L (L + 1 being identified with 1). From 
the remarks made above the Perron-Frobenius equation 
Pl-ttl = 7i~L of this process admits solutions of the form 



Jp,p ) 



,VL,HL) 



]li=i 7ri (^'^) where again (p,p') 



is any solution of (^). For finite L however tt^/' p is not pure. 
To get an understanding of the pure distributions we adopt a 
lattice gaz point of view, remarking that the quantity 



(15) 



is a conserved charge of the evolution (this conservation law is 
exact only under the above periodic boundary conditions) . It 
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can also be checked that the Markov process defined by (|1£ 
connects any two states (vi[n), (v'ifj,i) having the same charge 
C . It follows that there are exactly 2L + 1 pure distributions, 
in correspondence with the possible values — L < C < L. We 
can extract a formal expression for the pure invariant distri- 
bution irc(ui, [a) evolved from an arbitrary state of charge 
C, from the "mixed" 
Namely we have 



invariant distribution ir 



iv,p') 



Z(C) 



where Z(C) is a normalization factor. Note that, contrary to 
the appearances, the right-hand-side of ( |l6| ) does not depend 
on (p,p) (this can be seen directly from the expression of 
n^' p by making use of (H)). In the limit L — > oo, the fluc- 
tuations of C about its mean value L(p' — p) with respect to 
tt^' p ' become negligible, and we expect that the differences 
between the pure distributions ire (more precisely the dif- 
ferences between their finite correlation functions) for which 
C/L is close to p' — p will become unsignificant. This can 
be checked directly from (|l^) using a saddle-point evaluation. 
Hence the pure distributions on a periodic band of infinite 
width can be identified with a continuum of decorrelated dis- 
tributions 7r' p ' p \ parametrized by the solutions of ( |l3| ) for 
which < p,p' < 1. Returning to the original lattice Cnm, 
one must take care of the boundary conditions imposed along 
the axes i = and j = 0. The process will now develop only 
locally (i.e. along any given direction) according to a distri- 
bution of the form 7r' p,p \ with p and p' being functions of 
r — i/j. The cavity approach of section [n| allows to treat 
different boundary conditions in a simple way. For example, 
if the horizontal bonds along j — (resp. the vertical bonds 
along i — 0) are supposed to be occupied independently with 
probability p\ (resp. p%), where < pi,P2 < 1/(1 + vS) (the 
original problem corresponds to the case pi = p% = 0), one 
finds that 

p(r) =pi, < r < n, 



P\ r ) = — — — , n <r < r 2 , 



p(r) = 



S-l ' 

1 - p2 



r > r 2 , 



1+ (5 - l)p 2 

where r\ and ri are such that p\ — (yriS — 1)/(S — 1),P2 



(\/S/r2 — 1)/(S I — 1), and p ' (r) is such that ( |13[ ) is satisfied 
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